[mlir][gpu] Use `known_block_size` to set `maxntid` for NVVM target #77301

grypp · 2024-01-08T12:45:38Z

Setting thread block size with maxntid on the kernel has great performance benefits. In this way, downstream PTX compiler can do better register allocation.

MLIR's gpu.launch and gpu.launch_func already has an attribute (known_block_size) that keeps the thread block size when it is known. This PR simply uses this attribute to set maxntid.

Setting thread block size with `maxntid` on the kernel has great performance benefits. In this way, downstream PTX compiler can do better register allocation. MLIR's `gpu.launch` and `gpu.launch_func` already has an attribute (`known_block_size`) that keeps the thread block size when it is known. This PR simply uses this attribute to set `maxntid`.

llvmbot · 2024-01-08T12:46:07Z

@llvm/pr-subscribers-mlir-gpu

Author: Guray Ozen (grypp)

Changes

Setting thread block size with maxntid on the kernel has great performance benefits. In this way, downstream PTX compiler can do better register allocation.

MLIR's gpu.launch and gpu.launch_func already has an attribute (known_block_size) that keeps the thread block size when it is known. This PR simply uses this attribute to set maxntid.

Full diff: https://github.com/llvm/llvm-project/pull/77301.diff

4 Files Affected:

(modified) mlir/lib/Conversion/GPUCommon/GPUOpsLowering.cpp (+19-1)
(modified) mlir/lib/Conversion/GPUCommon/GPUOpsLowering.h (+9-4)
(modified) mlir/lib/Conversion/GPUToNVVM/LowerGpuOpsToNVVMOps.cpp (+3-1)
(modified) mlir/test/Conversion/GPUToNVVM/gpu-to-nvvm.mlir (+9)

diff --git a/mlir/lib/Conversion/GPUCommon/GPUOpsLowering.cpp b/mlir/lib/Conversion/GPUCommon/GPUOpsLowering.cpp
index 6a005e67ca95ba..eeb8fbbb180bad 100644
--- a/mlir/lib/Conversion/GPUCommon/GPUOpsLowering.cpp
+++ b/mlir/lib/Conversion/GPUCommon/GPUOpsLowering.cpp
@@ -85,8 +85,26 @@ GPUFuncOpLowering::matchAndRewrite(gpu::GPUFuncOp gpuFuncOp, OpAdaptor adaptor,
   // Add a dialect specific kernel attribute in addition to GPU kernel
   // attribute. The former is necessary for further translation while the
   // latter is expected by gpu.launch_func.
-  if (gpuFuncOp.isKernel())
+  if (gpuFuncOp.isKernel()) {
     attributes.emplace_back(kernelAttributeName, rewriter.getUnitAttr());
+
+    // Set the block size attribute if it is present.
+    if (kernelBlockSizeAttributeName.has_value()) {
+      std::optional<int32_t> dimX =
+          gpuFuncOp.getKnownBlockSize(gpu::Dimension::x);
+      std::optional<int32_t> dimY =
+          gpuFuncOp.getKnownBlockSize(gpu::Dimension::y);
+      std::optional<int32_t> dimZ =
+          gpuFuncOp.getKnownBlockSize(gpu::Dimension::z);
+      if (dimX.has_value() || dimY.has_value() || dimZ.has_value()) {
+        // If any of the dimensions are missing, fill them in with 1.
+        attributes.emplace_back(
+            kernelBlockSizeAttributeName.value(),
+            rewriter.getI32ArrayAttr(
+                {dimX.value_or(1), dimY.value_or(1), dimZ.value_or(1)}));
+      }
+    }
+  }
   auto llvmFuncOp = rewriter.create<LLVM::LLVMFuncOp>(
       gpuFuncOp.getLoc(), gpuFuncOp.getName(), funcType,
       LLVM::Linkage::External, /*dsoLocal=*/false, /*cconv=*/LLVM::CConv::C,
diff --git a/mlir/lib/Conversion/GPUCommon/GPUOpsLowering.h b/mlir/lib/Conversion/GPUCommon/GPUOpsLowering.h
index a77db4a036bad3..471a688e85463e 100644
--- a/mlir/lib/Conversion/GPUCommon/GPUOpsLowering.h
+++ b/mlir/lib/Conversion/GPUCommon/GPUOpsLowering.h
@@ -36,13 +36,15 @@ struct GPUDynamicSharedMemoryOpLowering
 };
 
 struct GPUFuncOpLowering : ConvertOpToLLVMPattern<gpu::GPUFuncOp> {
-  GPUFuncOpLowering(const LLVMTypeConverter &converter,
-                    unsigned allocaAddrSpace, unsigned workgroupAddrSpace,
-                    StringAttr kernelAttributeName)
+  GPUFuncOpLowering(
+      const LLVMTypeConverter &converter, unsigned allocaAddrSpace,
+      unsigned workgroupAddrSpace, StringAttr kernelAttributeName,
+      std::optional<StringAttr> kernelBlockSizeAttributeName = std::nullopt)
       : ConvertOpToLLVMPattern<gpu::GPUFuncOp>(converter),
         allocaAddrSpace(allocaAddrSpace),
         workgroupAddrSpace(workgroupAddrSpace),
-        kernelAttributeName(kernelAttributeName) {}
+        kernelAttributeName(kernelAttributeName),
+        kernelBlockSizeAttributeName(kernelBlockSizeAttributeName) {}
 
   LogicalResult
   matchAndRewrite(gpu::GPUFuncOp gpuFuncOp, OpAdaptor adaptor,
@@ -56,6 +58,9 @@ struct GPUFuncOpLowering : ConvertOpToLLVMPattern<gpu::GPUFuncOp> {
 
   /// The attribute name to use instead of `gpu.kernel`.
   StringAttr kernelAttributeName;
+
+  /// The attribute name to to set block size
+  std::optional<StringAttr> kernelBlockSizeAttributeName;
 };
 
 /// The lowering of gpu.printf to a call to HIP hostcalls
diff --git a/mlir/lib/Conversion/GPUToNVVM/LowerGpuOpsToNVVMOps.cpp b/mlir/lib/Conversion/GPUToNVVM/LowerGpuOpsToNVVMOps.cpp
index e60fe5cbd7603f..a7ac2332961ae2 100644
--- a/mlir/lib/Conversion/GPUToNVVM/LowerGpuOpsToNVVMOps.cpp
+++ b/mlir/lib/Conversion/GPUToNVVM/LowerGpuOpsToNVVMOps.cpp
@@ -352,7 +352,9 @@ void mlir::populateGpuToNVVMConversionPatterns(LLVMTypeConverter &converter,
       /*workgroupAddrSpace=*/
       static_cast<unsigned>(NVVM::NVVMMemorySpace::kSharedMemorySpace),
       StringAttr::get(&converter.getContext(),
-                      NVVM::NVVMDialect::getKernelFuncAttrName()));
+                      NVVM::NVVMDialect::getKernelFuncAttrName()),
+      StringAttr::get(&converter.getContext(),
+                      NVVM::NVVMDialect::getMaxntidAttrName()));
 
   populateOpPatterns<math::AbsFOp>(converter, patterns, "__nv_fabsf",
                                    "__nv_fabs");
diff --git a/mlir/test/Conversion/GPUToNVVM/gpu-to-nvvm.mlir b/mlir/test/Conversion/GPUToNVVM/gpu-to-nvvm.mlir
index 20a200e812c125..c7f1d4f124c186 100644
--- a/mlir/test/Conversion/GPUToNVVM/gpu-to-nvvm.mlir
+++ b/mlir/test/Conversion/GPUToNVVM/gpu-to-nvvm.mlir
@@ -627,6 +627,15 @@ gpu.module @test_module_31 {
   }
 }
 
+gpu.module @gpumodule {
+// CHECK-LABEL: func @kernel_with_block_size()
+// CHECK: attributes {gpu.kernel, gpu.known_block_size = array<i32: 128, 1, 1>, nvvm.kernel, nvvm.maxntid = [128 : i32, 1 : i32, 1 : i32]} 
+  gpu.func @kernel_with_block_size() kernel attributes {gpu.known_block_size = array<i32: 128, 1, 1>} {
+    gpu.return
+  }
+}
+
+
 module attributes {transform.with_named_sequence} {
   transform.named_sequence @__transform_main(%toplevel_module: !transform.any_op {transform.readonly}) {
     %gpu_module = transform.structured.match ops{["gpu.module"]} in %toplevel_module

llvmbot · 2024-01-08T12:46:07Z

@llvm/pr-subscribers-mlir

Author: Guray Ozen (grypp)

Changes

Setting thread block size with maxntid on the kernel has great performance benefits. In this way, downstream PTX compiler can do better register allocation.

MLIR's gpu.launch and gpu.launch_func already has an attribute (known_block_size) that keeps the thread block size when it is known. This PR simply uses this attribute to set maxntid.

Full diff: https://github.com/llvm/llvm-project/pull/77301.diff

4 Files Affected:

(modified) mlir/lib/Conversion/GPUCommon/GPUOpsLowering.cpp (+19-1)
(modified) mlir/lib/Conversion/GPUCommon/GPUOpsLowering.h (+9-4)
(modified) mlir/lib/Conversion/GPUToNVVM/LowerGpuOpsToNVVMOps.cpp (+3-1)
(modified) mlir/test/Conversion/GPUToNVVM/gpu-to-nvvm.mlir (+9)

diff --git a/mlir/lib/Conversion/GPUCommon/GPUOpsLowering.cpp b/mlir/lib/Conversion/GPUCommon/GPUOpsLowering.cpp
index 6a005e67ca95ba..eeb8fbbb180bad 100644
--- a/mlir/lib/Conversion/GPUCommon/GPUOpsLowering.cpp
+++ b/mlir/lib/Conversion/GPUCommon/GPUOpsLowering.cpp
@@ -85,8 +85,26 @@ GPUFuncOpLowering::matchAndRewrite(gpu::GPUFuncOp gpuFuncOp, OpAdaptor adaptor,
   // Add a dialect specific kernel attribute in addition to GPU kernel
   // attribute. The former is necessary for further translation while the
   // latter is expected by gpu.launch_func.
-  if (gpuFuncOp.isKernel())
+  if (gpuFuncOp.isKernel()) {
     attributes.emplace_back(kernelAttributeName, rewriter.getUnitAttr());
+
+    // Set the block size attribute if it is present.
+    if (kernelBlockSizeAttributeName.has_value()) {
+      std::optional<int32_t> dimX =
+          gpuFuncOp.getKnownBlockSize(gpu::Dimension::x);
+      std::optional<int32_t> dimY =
+          gpuFuncOp.getKnownBlockSize(gpu::Dimension::y);
+      std::optional<int32_t> dimZ =
+          gpuFuncOp.getKnownBlockSize(gpu::Dimension::z);
+      if (dimX.has_value() || dimY.has_value() || dimZ.has_value()) {
+        // If any of the dimensions are missing, fill them in with 1.
+        attributes.emplace_back(
+            kernelBlockSizeAttributeName.value(),
+            rewriter.getI32ArrayAttr(
+                {dimX.value_or(1), dimY.value_or(1), dimZ.value_or(1)}));
+      }
+    }
+  }
   auto llvmFuncOp = rewriter.create<LLVM::LLVMFuncOp>(
       gpuFuncOp.getLoc(), gpuFuncOp.getName(), funcType,
       LLVM::Linkage::External, /*dsoLocal=*/false, /*cconv=*/LLVM::CConv::C,
diff --git a/mlir/lib/Conversion/GPUCommon/GPUOpsLowering.h b/mlir/lib/Conversion/GPUCommon/GPUOpsLowering.h
index a77db4a036bad3..471a688e85463e 100644
--- a/mlir/lib/Conversion/GPUCommon/GPUOpsLowering.h
+++ b/mlir/lib/Conversion/GPUCommon/GPUOpsLowering.h
@@ -36,13 +36,15 @@ struct GPUDynamicSharedMemoryOpLowering
 };
 
 struct GPUFuncOpLowering : ConvertOpToLLVMPattern<gpu::GPUFuncOp> {
-  GPUFuncOpLowering(const LLVMTypeConverter &converter,
-                    unsigned allocaAddrSpace, unsigned workgroupAddrSpace,
-                    StringAttr kernelAttributeName)
+  GPUFuncOpLowering(
+      const LLVMTypeConverter &converter, unsigned allocaAddrSpace,
+      unsigned workgroupAddrSpace, StringAttr kernelAttributeName,
+      std::optional<StringAttr> kernelBlockSizeAttributeName = std::nullopt)
       : ConvertOpToLLVMPattern<gpu::GPUFuncOp>(converter),
         allocaAddrSpace(allocaAddrSpace),
         workgroupAddrSpace(workgroupAddrSpace),
-        kernelAttributeName(kernelAttributeName) {}
+        kernelAttributeName(kernelAttributeName),
+        kernelBlockSizeAttributeName(kernelBlockSizeAttributeName) {}
 
   LogicalResult
   matchAndRewrite(gpu::GPUFuncOp gpuFuncOp, OpAdaptor adaptor,
@@ -56,6 +58,9 @@ struct GPUFuncOpLowering : ConvertOpToLLVMPattern<gpu::GPUFuncOp> {
 
   /// The attribute name to use instead of `gpu.kernel`.
   StringAttr kernelAttributeName;
+
+  /// The attribute name to to set block size
+  std::optional<StringAttr> kernelBlockSizeAttributeName;
 };
 
 /// The lowering of gpu.printf to a call to HIP hostcalls
diff --git a/mlir/lib/Conversion/GPUToNVVM/LowerGpuOpsToNVVMOps.cpp b/mlir/lib/Conversion/GPUToNVVM/LowerGpuOpsToNVVMOps.cpp
index e60fe5cbd7603f..a7ac2332961ae2 100644
--- a/mlir/lib/Conversion/GPUToNVVM/LowerGpuOpsToNVVMOps.cpp
+++ b/mlir/lib/Conversion/GPUToNVVM/LowerGpuOpsToNVVMOps.cpp
@@ -352,7 +352,9 @@ void mlir::populateGpuToNVVMConversionPatterns(LLVMTypeConverter &converter,
       /*workgroupAddrSpace=*/
       static_cast<unsigned>(NVVM::NVVMMemorySpace::kSharedMemorySpace),
       StringAttr::get(&converter.getContext(),
-                      NVVM::NVVMDialect::getKernelFuncAttrName()));
+                      NVVM::NVVMDialect::getKernelFuncAttrName()),
+      StringAttr::get(&converter.getContext(),
+                      NVVM::NVVMDialect::getMaxntidAttrName()));
 
   populateOpPatterns<math::AbsFOp>(converter, patterns, "__nv_fabsf",
                                    "__nv_fabs");
diff --git a/mlir/test/Conversion/GPUToNVVM/gpu-to-nvvm.mlir b/mlir/test/Conversion/GPUToNVVM/gpu-to-nvvm.mlir
index 20a200e812c125..c7f1d4f124c186 100644
--- a/mlir/test/Conversion/GPUToNVVM/gpu-to-nvvm.mlir
+++ b/mlir/test/Conversion/GPUToNVVM/gpu-to-nvvm.mlir
@@ -627,6 +627,15 @@ gpu.module @test_module_31 {
   }
 }
 
+gpu.module @gpumodule {
+// CHECK-LABEL: func @kernel_with_block_size()
+// CHECK: attributes {gpu.kernel, gpu.known_block_size = array<i32: 128, 1, 1>, nvvm.kernel, nvvm.maxntid = [128 : i32, 1 : i32, 1 : i32]} 
+  gpu.func @kernel_with_block_size() kernel attributes {gpu.known_block_size = array<i32: 128, 1, 1>} {
+    gpu.return
+  }
+}
+
+
 module attributes {transform.with_named_sequence} {
   transform.named_sequence @__transform_main(%toplevel_module: !transform.any_op {transform.readonly}) {
     %gpu_module = transform.structured.match ops{["gpu.module"]} in %toplevel_module

joker-eph · 2024-01-08T12:54:05Z

mlir/lib/Conversion/GPUCommon/GPUOpsLowering.cpp

+        // If any of the dimensions are missing, fill them in with 1.
+        attributes.emplace_back(
+            kernelBlockSizeAttributeName.value(),
+            rewriter.getI32ArrayAttr(


Can we use a DenseArray?

Sure, I can use DenseArray and I also think I should.

Some adjustments are needed here, and need to change some test. I can do it in a separate PR if it's okay

…lvm#77301) Setting thread block size with `maxntid` on the kernel has great performance benefits. In this way, downstream PTX compiler can do better register allocation. MLIR's `gpu.launch` and `gpu.launch_func` already has an attribute (`known_block_size`) that keeps the thread block size when it is known. This PR simply uses this attribute to set `maxntid`.

krzysz00 · 2024-09-06T21:06:38Z

@grypp Having been poking around all the relevant code ... shouldn't known_block_size correspond to reqdntid and not maxntid? That annotation is an array of the statically-known block sizes that will be used at launch.

(By analogy, the ROCDL version uses !reqd_work_group_size, not any of the weaker properties)

llvmbot added mlir:gpu mlir labels Jan 8, 2024

grypp requested review from ftynse, matthias-springer, jpienaar and joker-eph January 8, 2024 12:47

apaszke approved these changes Jan 8, 2024

View reviewed changes

joker-eph reviewed Jan 8, 2024

View reviewed changes

joker-eph approved these changes Jan 8, 2024

View reviewed changes

grypp merged commit 763109e into llvm:main Jan 8, 2024
6 checks passed

grypp deleted the set-maxncta branch January 8, 2024 13:49

grypp mentioned this pull request Jan 9, 2024

[mlir][gpu] Use DenseI32Array for NVVM's maxntid and reqntid (NFC) #77466

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[mlir][gpu] Use `known_block_size` to set `maxntid` for NVVM target #77301

[mlir][gpu] Use `known_block_size` to set `maxntid` for NVVM target #77301

grypp commented Jan 8, 2024

llvmbot commented Jan 8, 2024

llvmbot commented Jan 8, 2024

joker-eph Jan 8, 2024

grypp Jan 8, 2024

krzysz00 commented Sep 6, 2024

[mlir][gpu] Use known_block_size to set maxntid for NVVM target #77301

[mlir][gpu] Use known_block_size to set maxntid for NVVM target #77301

Conversation

grypp commented Jan 8, 2024

llvmbot commented Jan 8, 2024

llvmbot commented Jan 8, 2024

joker-eph Jan 8, 2024

Choose a reason for hiding this comment

grypp Jan 8, 2024

Choose a reason for hiding this comment

krzysz00 commented Sep 6, 2024

[mlir][gpu] Use `known_block_size` to set `maxntid` for NVVM target #77301

[mlir][gpu] Use `known_block_size` to set `maxntid` for NVVM target #77301